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Introduction 

This  project  is  designed  to  address  the  subject  of  mammary  cancer  development.  The  purpose  of  the  project  is 
to  investigate  molecular  events  occurring  in  the  preclinical  stages  of  mammary  cancer;  the  results  may  lead  to 
insights  into  cancer  prevention  in  the  future.  Specifically,  the  project  investigates  the  intersection  between 
genome  demethylation,  retrotransposon  transcriptional  activity,  and  retrotransposon-driven  transcription  of 
cellular  genes.  Retrotransposon  promoters  are  well  recognized  to  function  as  alternative  promoters  for  different 
cellular  genes,  generating  chimeric  transcripts  that  may  or  may  not  function  in  the  same  way  as  transcripts  from 
the  regular  gene  promoter.  Transcriptional  activation  of  retrotransposons  is  strongly  linked  with  their  CpG  DNA 
methylation,  and  global  genomic  demethylation  is  one  of  the  commonest  molecular  changes  in  malignancies. 
The  project  tests  the  hypothesis  that,  in  preclinical  stages  of  tumour  development,  progressive  genomic 
demethylation  leads  to  increased  transcriptional  activity  of  retrotransposons  and  this,  in  turn,  leads  to 
transcription  of  otherwise  silent  genes,  potentially  setting  up  molecular  conditions  that  favour  cancer 
development.  Our  collaborator,  Dr.  Anne  Peaston,  developed  a  genetically  engineered  mouse  model  in  which  a 
specific  mammary  cell  population  is  fluorescently  marked  upon  initial  transcriptional  activation  of  the  SV40 
large  T  antigen  (SV40Tag)  oncogene.  SV40Tag  is  transcriptionally  activated  during  pregnancy  and  lactation, 
and  the  mice  are  predisposed  to  develop  mammary  cancer  after  3  pregnancies  and  lactations.  Using  this  model, 
populations  of  marked  cells  can  be  collected  for  integrated  analysis  of  gene  expression,  promoter  usage,  and 
DNA  methylation  after  defined  amounts  of  exposure  to  SV40Tag  during  different  stages  of  preclinical  cancer 
development. 

Body 

The  relevant  sections  from  the  Statement  of  Work  are  shown  in  the  table  below  with  corresponding  goals  and 
results.  A  no  cost  extension  was  received  to  continue  this  work  in  year  3  in  light  of  some  of  the  technical  issues 
with  the  mouse  breeding  experiments  (see  Dr.  Peaston’ s,  collaborating  PI,  report  for  more  information).  While 
waiting  on  the  shipments  of  DNA  samples  to  begin  processing  them,  we  have  worked  on  methodological 
improvements  to  streamline  sample  preparation  and  began  to  establish  an  analysis  pipeline  that  can  handle  the 
three  data  types  that  will  be  produced  in  this  proposal.  This  pipeline  will  greatly  facilitate  final  analysis  and 
interpretation  of  results  for  this  project  in  the  upcoming  year.  These  advancements  and  their  importance  to  the 
project  are  described  below.  In  particular  we  feel  confident  that  the  streamlined  protocol  and  analysis  tools  for 
Methyl-MAPS  that  we  have  developed  will  easily  allow  us  to  complete  all  Methyl-MAPS  analyses  and  final 
computational  analyses  outlined  in  the  Statement  of  Work  before  the  project  end. 


Year  1  &  2:  Items  from  Statement  of  Work  Relevant  to  Edwards  Lab. 


Year, 

Months 

Goal 

Result 

Y1 

1-3 

4-6 

7-9 

10-12 

Y2 

1-3 

4-6 

7-9 

10-12 

4. 

9. 

5. 

7. 

8. 

8. 

9,10,11. 

8. 

•  Set  up  schedule  for  formal  monthly 
electronic  lab  meeting  between  Peaston  lab 
and  Edwards  lab.  And  regularly  hold 
meetings. 

•  An  informal  schedule  was  set  up  with  a  plan 
for  regular  meetings  to  commence  after  the 
first  DNA  shipments  are  sent 

Y1 

4-6 

6. 

•  Preliminary  Methyl-MAPS  analysis  of  pilot 
virgin  samples 

•  Initial  DNA  shipment,  has  been  received  and 
libraries  are  under  construction. 

Y1 

10-12 

7. 

•  Methyl-MAPS  library  preparation  and 
sequencing  for  replicate  #1  uniparous  & 
triparous  control  and  tumor-prone 

•  Initial  DNA  shipment,  has  been  received  and 
libraries  are  under  construction. 

4 


Y2 

1-3 

6. 

7. 

•  Preliminary  Methyl-MAPS  analysis  of 
replicate  #1 

•  Review  DNA  loci  of  interest  for  bisulfite 
sequencing  in  light  of  PCR  results  and 
library  analysis,  Continue  bisulfite 
sequencing  assessment  of  DNA  loci  of 
interest 

•  Once  libraries  are  made  they  will  be 
sequenced,  analysis  pipeline  is  ready  to  go. 

Y2 

3. 

•  Methyl-MAPS  library  preparation  and 

•  We  are  awaiting  shipment  of  the  relevant 

4-6 

sequencing  for  replicate  #2  uniparous  & 

material. 

5. 

triparous  control  and  tumor-prone 

•  Library  analyses  replicate  #2  and 

•  Analysis  pipelines  have  been  developed  and 

preliminary  comparisons  with  replicate  #1 

are  in  place  for  when  data  is  generated. 

Y2 

4. 

•  Methyl-MAPS  library  preparation  and 

•  We  are  awaiting  shipment  of  the  relevant 

7-9 

sequencing  for  replicate  #3  uniparous  & 

material. 

triparous  control  and  tumor-prone 

Y2 

2. 

•  Finalize  data  analysis 

•  Analysis  pipelines  have  been  developed  and 

10-12 

3. 

•  Prepare  a  report  for  publication  of  the  results 

are  in  place  for  when  data  is  generated. 

5. 

Reports  will  be  finalized  as  the  data  is. 

Methyl-MAPS  Analysis  of  test  samples  and  replicate  #1 

DNAs  have  been  received  and  we  are  currently  constructing  these  with  our  improved  Methyl-MAPS  protocols 
(described  in  last  year’s  report). 

Computational  Pipeline  Improvements 

We  have  developed  the  WIMSi  (Washington  University  Methylation  Signatures)  pipeline  to  enable  us  to  use 
genome-wide  methylation  and  expression  data  to  examine  the  relationship  between  DNA  methylation  and 
expression.  An  initial  description  of  the  method  and  results  can  be  found  in  last  years  report,  while  a  full 
description  is  in  VanderKraats  et  al,  20131  attached  in  Appendix. 

In  brief,  current  computational  tools  focus  on  methods  to  compute  accurate  methylation  levels  from  the  data, 
methods  to  determine  differentially  methylated  regions,  and  visualization  tools,  such  as  genome  browsers. 
These  approaches  have  elucidated  the  genomic  organization  of  these  marks,  but  they  do  not  sufficiently  address 
how  changes  at  individual  loci  potentially  affect  function.  Correspondingly,  such  methods  find  only  a  modest 
negative  correlation  between  differential  DNA  methylation  at  promoters  and  expression.  While  it  is  possible 
that  DNA  methylation  is  not  a  strong  modifier  of  expression,  it  is  more  likely  that  our  computational  tools  are 
insufficient.  We  hypothesize  that  stronger  associations  are  not  observed  because  existing  analysis  methods 
oversimplify  their  representation  of  the  data  such  that  they  do  not  capture  the  diversity  of  existing  methylation 
patterns.  This  includes  changes  at  the  CpG  island  at  a  gene’s  TSS,  changes  at  CpG  island  shores  and  the 
formation  of  long  partially-hypomethylated  domains.  In  addition,  there  are  not  methods  to  systematically  search 
for  new  patterns.  We  have  developed  a  new  set  of  tools  for  discovering  differential  methylation  patterns 
associated  with  expression  change  using  genome-wide  high-resolution  methylation  data:  we  represent 
differential  methylation  as  an  interpolated  curve,  or  signature,  and  then  identify  groups  of  genes  with  similarly- 
shaped  signatures  and  corresponding  expression  changes.  Our  methods  uncover  a  diverse  set  of  DNA 
methylation  patterns  that  are  conserved  across  embryonic  stem  cell  and  cancer  datasets.  We  find  patterns 
consistent  with  a  methylation  change  at  CpG  island  shores  that  correlate  with  expression  change  when  they  are 
to  the  3’  of  the  TSS.  However,  we  find  no  evidence  for  patterns  that  consist  of  changes  in  island  shores  5’  of  the 
TSS  that  correlate  with  transcription.  Overall,  we  find  strong  associations  between  these  methylation  patterns 
and  expression.  We  further  show  that  an  extension  of  our  method  also  outperforms  current  approaches  by 
generating  a  longer  list  of  genes  with  higher  quality  associations  between  differential  methylation  and 
expression.  Source  code  from  the  method  is  publically  available  for  download  at: 
sourceforge.net/projects/wimsi.  We  have  made  some  further  refinements  since  this  publication  including  the 
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Figure  1  Comparison  of  the  quality  of  gene  lists  generated  using  two  versions  of  our  approach  (WIMSi-based 
kNN)  with  those  from  optimized  DMR-based  methods  for  (left)  HMEC-HCC 1954  WGBS,  (center)  MCF7-T47D 
Methyl-MAPS  (right)  IMR90-H1  WGBS  data.  Our  method  was  run  using  a  distance  metric  based  on  either 
Dynamic  Time  Warping  (DTW)  or  Frechet  Distance  (FD).  The  plot  shows  the  trade-off  between  the  number  of 
genes  predicted  to  have  differential  expression  based  on  their  methylation  (x-axis)  and  the  quality  of  the 
predictions  (y-axis)  using  different  parameters  for  each  approach.  Points  up  and  to  the  right  indicate  better 
performance;  50%  quality  is  equivalent  to  random  guessing.  Only  optimal  parameter  choices  with  an  inverse 
correlation  between  methylation  and  expression  are  shown  for  DMR-based  approaches.  WGBS  =  Whole-genome 

addition  of  multiple  distance  metrics  including  Dynamic  Time  Warping. 

Gene  List  Generation 

Enumerating  a  list  of  genes  for  which  expression  and  methylation  changes  are  potentially  linked  is  a  primary 
interest  of  any  genome-wide  methylation  profiling  experiment.  The  WIMSi  method  outlined  above  is  tuned  to 
discover  patterns,  and  thus  produces  a  conservative  gene  list  potentially  prone  to  false  positives.  Our  initial 
publication  describes  one  way  to  use  this  method  to  generate  a  gene  list,  however,  we  have  further  refined  our 
approach  using  a  set  of  supervised  machine  learning  methods. 

At  one  level,  the  basic  principal  behind  WIMSi  is  to  define  the  distance  between  any  two  methylation  signatures 
in  terms  of  shape  similarity  or  distance  between  curves.  Once  these  distances  are  defined,  we  use  a  k-nearest 
neighbors  based  algorithm  in  which  the  expression  labels  for  one  set  of  genes  is  known  and  the  expression  of  a 
new  gene  is  predicted  based  on  the  similarity  of  it’s  methylation  pattern.  For  example  if  a  new  gene  has  a 
methylation  signature  similar  to  10  (or  k  =10)  other  genes  whose  expression  is  silenced  then  likely  the  new  gene 
is  also  silenced.  Our  initial  results  have  shown  that  this  is  a  powerful  and  stable  method  for  using  methylation 
patterns  to  predict  expression.  This  method  works  much  better  than  classical  DMR  algorithms  at  predicting 
expression  (Fig.  1)  and  thus  at  finding  associations  between  methylation  and  expression.  Our  initial  results  also 
show  this  method  to  be  comparatively  robust  to  choices  of  parameters  across  datasets.  For  instance,  in  our 
method  if  we  train  a  predictive  model  on  one  dataset  we  can  use  this  model  to  accurately  predict  expression 
changes  in  another.  Whereas  it  appears  these  parameters  must  be  re-optimized  for  every  dataset  with  traditional 
DMR-based  methods2. 

Our  findings  suggest  that  the  role  of  DNA  methylation  cannot  be  fully  described  by  simply  characterizing  every 
gene  as  “methylated”  or  “unmethylated”.  Using  our  new  method,  we  have  found  and  described  a  variety  of 
methylation  patterns  that  correlate  with  expression  change.  The  true  power  of  this  method  is  in  its  ability  to 
discover  and  separate  distinct  patterns  without  a  priori  knowledge  about  existing  correlations,  which  cannot  be 
accomplished  with  contemporary  approaches.  This  allows  us  to  realize  the  full  potential  of  unbiased  genome¬ 
wide  profiling  of  DNA  methylation  to  reveal  previously  unknown  information  about  methylation’ s  functional 
role.  This  ability  will  be  especially  important  when  examining  regulatory  elements  within  retroelements  as  will 
be  performed  in  this  work. 
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One  additional  implication  of  these  results  also  becomes  clear.  The  simplified  models  used  in  prior  approaches 
at  best  produce  weak  correlations.  However,  if  one  considers  a  more  formal  description  of  the  underlying 
patterns  of  methylation  changes,  methylation  and  expression  data  are  highly  correlated.  This  tool  was  designed 
to  start  from  a  list  of  expression  data,  corresponding  transcription  start  sites  (TSSs)  and  high-resolution  genome¬ 
wide  methylation  data  such  as  from  Methy  1-MAPS.  Thus  it  fits  perfectly  into  the  framework  of  this  proposal 
where  we  will  have  RNA-seq  data  for  expression,  CAGE  data  to  mark  the  TSSs  and  Methyl-MAPS  methylation 
data  for  each  sample.  The  adaptations  to  accommodate  these  datasets  to  address  the  regulation  of 
retrotransposons  are  straightforward  and  we  have  performed  some  initial  analyses  of  LTR-driven  IncRNAs  as  a 
successful  proof-of-concept.  We  have  an  established  pipeline  in  place  and  ready  for  the  data  as  it  is  produced  in 
the  upcoming  year. 

Key  Research  Accomplishments 

•  Developed  new  computational  tool  to  combine  genome-wide  expression  and  methylation  data  to  output 
a  list  of  genes  where  methylation  likely  contributes  to  their  silencing  or  activation. 

Reportable  Outcomes 

Manuscripts 

VanderKraats  ND,  Hiken  JF,  Decker  KF,  Edwards  JR.  (2013)  “Discovering  high-resolution  patterns  of 
differential  DNA  methylation  that  correlate  with  gene  expression  changes.”  Nucleic  Acids  Res  41:6816-27. 

Abstracts 

VanderKraats  ND,  Hiken  JF,  Decker  KF,  Edwards  JR.  (2013)  “Identification  of  tissue-  and  cancer-specific 
genes  with  coordinate  methylation  and  gene  expression  changes.”  Epigenetics  &  Chromatin  Gordon  Research 
Conference,  Aug.  2-9. 

Conclusions 

The  streamlined  Methyl-MAPS  protocol  and  the  computational  analysis  pipeline  we  have  now  established  will 
be  invaluable  for  pushing  the  project  ahead.  The  primary  task  for  my  lab  was  to  provide  Methyl-MAPS 
genome-wide  methylation  profiling  for  tumor  DNA  from  the  uniparous  and  triparous  female  mice  from  Dr. 
Peaston’s  mouse  model.  The  tools  we  have  in  place  will  make  this  easy  to  accomplish  in  the  upcoming  year. 
The  computational  tools  we  have  developed  are  designed  to  work  with  annotated  genes  as  we  have  outlined,  but 
can  also  be  expanded  to  any  transcriptional  unit  with  a  known  TSS  and  known  expression  value.  We  will  thus 
be  able  to  integrate  each  of  the  datasets  generated  in  this  project  (CAGE,  RNA-seq  and  Methyl-MAPS)  to 
address  the  hypothesis  that,  in  preclinical  stages  of  tumor  development,  genomic  demethylation  leads  to 
increased  transcriptional  activity  of  retrotransposons  and  this,  in  turn,  leads  to  transcription  of  otherwise  silent 
genes,  potentially  setting  up  molecular  conditions  that  favor  cancer  development. 


References 

1  Vanderkraats,  N.  D.,  Hiken,  J.  F.,  Decker,  K.  F.  &  Edwards,  J.  R.  Discovering  high-resolution  patterns 
of  differential  DNA  methylation  that  correlate  with  gene  expression  changes.  Nucleic  Acids  Res  41, 
6816-6827,  doi:  10.1 093/nar/gkt482  (2013). 

2  Hansen,  K.  D.,  Fangmead,  B.  &  Irizarry,  R.  A.  BSmooth:  from  whole  genome  bisulfite  sequencing  reads 
to  differentially  methylated  regions.  Genome  biology  13,  R83,  doi:  10.1 186/gb-20 12- 13-1 0-r83  (2012). 


Appendices 

Attached  Vanderkraats  el  al  2013  publication. 


7 


6816-6827  Nucleic  Acids  Research,  2013,  Vol.  41,  No.  14 
doi:10.1093/nar/gkt482 


Published  online  7  June  2013 


Discovering  high-resolution  patterns  of  differential 
DNA  methylation  that  correlate  with  gene 
expression  changes 

Nathan  D.  VanderKraats,  Jeffrey  F.  Hiken,  Keith  F.  Decker  and  John  R.  Edwards* 

Center  for  Pharmacogenomics,  Department  of  Medicine,  Washington  University  School  of  Medicine,  660  S. 
Euclid  Ave,  Campus  Box  8220,  St.  Louis,  MO  63110,  USA 

Received  March  22,  2013;  Revised  April  25,  2013;  Accepted  May  10,  2013 


ABSTRACT 

Methylation  of  the  CpG-rich  region  (CpG  island) 
overlapping  a  gene’s  promoter  is  a  generally 
accepted  mechanism  for  silencing  expression. 
While  recent  technological  advances  have  enabled 
measurement  of  DNA  methylation  and  expression 
changes  genome-wide,  only  modest  correlations 
between  differential  methylation  at  gene  promoters 
and  expression  have  been  found.  We  hypothesize 
that  stronger  associations  are  not  observed 
because  existing  analysis  methods  oversimplify  their 
representation  of  the  data  and  do  not  capture  the  di¬ 
versity  of  existing  methylation  patterns.  Recently, 
other  patterns  such  as  CpG  island  shore  methylation 
and  long  partially  hypomethylated  domains  have  also 
been  linked  with  gene  silencing.  Here,  we  detail  a  new 
approach  for  discovering  differential  methylation 
patterns  associated  with  expression  change  using 
genome-wide  high-resolution  methylation  data:  we 
represent  differential  methylation  as  an  interpolated 
curve,  or  signature,  and  then  identify  groups  of  genes 
with  similarly  shaped  signatures  and  corresponding 
expression  changes.  Our  technique  uncovers  a 
diverse  set  of  patterns  that  are  conserved  across  em¬ 
bryonic  stem  cell  and  cancer  data  sets.  Overall,  we 
find  strong  associations  between  these  methylation 
patterns  and  expression.  We  further  show  that  an 
extension  of  our  method  also  outperforms  other 
approaches  by  generating  a  longer  list  of  genes 
with  higher  quality  associations  between  differential 
methylation  and  expression. 

INTRODUCTION 

DNA  methylation  is  an  important  factor  in  transcrip¬ 
tional  regulation,  playing  a  role  in  genomic  imprinting, 
X-inactivation,  retrotransposon  silencing  and  the  control 


of  tissue-specific  genes  during  differentiation  (1).  DNA 
methylation  patterns  are  frequently  altered  in  tumors 

(2) ,  and  there  is  great  interest  in  understanding  how 
changes  to  these  patterns  contribute  to  human  disease 

(3) .  Even  so,  how  alterations  to  DNA  methylation  affect 
gene  transcription  remains  poorly  characterized.  Over 
60%  of  genes  have  a  CpG-rich  region,  termed  a  CpG 
island,  overlapping  their  promoter  (4).  Classically,  it  is 
thought  that  hypermethylation  of  promoter-associated 
CpG  islands  silences  transcription.  However,  it  was 
recently  shown  that  cancer-  and  tissue-specific  methyla¬ 
tion  variation  in  adjacent  regions,  termed  CpG  island 
shores,  is  also  associated  with  gene  expression  change 

(5) .  Additionally,  genes  are  more  likely  to  be  repressed 
when  they  are  located  in  partially  methylated  domains 

(6)  or  long  hypomethylated  domains  (7,8)  in  cancer. 

Techniques  such  as  whole-genome  bisulfite  sequencing 

(WGBS)  (9)  and  Methyl-MAPS  (10)  have  recently  been 
developed  to  map  methylation  at  single-base  resolution 
genome-wide.  Methods  to  interpret  this  data,  however, 
are  lacking.  Current  computational  techniques  are 
mostly  concerned  with  the  visualization  of  genome-level 
correlations  between  DNA  methylation  and  other  epigen¬ 
etic  marks,  or  with  the  identification  of  regions  that  are 
differentially  marked  between  samples  [recently  reviewed 
in  (11)].  These  tools  have  elucidated  the  genomic  organ¬ 
ization  of  these  marks,  but  they  do  not  sufficiently  address 
how  changes  at  individual  loci  associate  with  and  poten¬ 
tially  affect  function. 

The  most  common  approach  for  characterizing  methy¬ 
lation  changes  between  two  samples  uses  a  sliding  window 
to  identify  differentially  methylated  regions  (DMRs)  (7,9). 
A  gene  with  a  hypermethylated  DMR  near  its  promoter  is 
assumed  to  exhibit  a  decrease  in  expression,  while  a  gene 
near  a  hypomethylated  DMR  should  exhibit  an  increase  in 
expression.  In  practice,  the  Pearson  correlation  coefficient 
between  the  methylation  level  of  the  DMR  and  the  expres¬ 
sion  of  its  associated  gene  is  around  —0.3  (11,12).  It  has 
been  assumed  that  better  anticorrelation  is  precluded 
owing  to  noise  from  experimental  error,  mixed  cellular 
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populations,  copy  number  variations,  chromatin  modifiers 
or  other  regulation  events.  Another  explanation,  however, 
is  that  contemporary  analysis  methods  are  not 
sophisticated  enough  to  recognize  relationships  involving 
more  complex  methylation  patterns.  Existing  approaches 
summarize  their  representation  of  methylation  change  in 
the  promoter  region  to  simplify  analysis,  but  this  sacrifices 
potentially  important  spatial  information  contained  in  the 
locations  of  the  constituent  sites. 

To  discover  DNA  methylation  changes  that  associate 
with  gene  expression  changes,  we  propose  a  new  method 
that  uses  the  entire  differential  methylation  profile  in  the 
vicinity  of  a  gene’s  promoter  (Figure  1).  We  represent  the 
differential  methylation  for  a  fixed  area  around  each 
gene’s  transcription  start  site  (TSS)  as  a  continuous 
curve,  or  signature,  capturing  the  shape  of  the  methylation 
changes.  We  then  apply  a  curve  similarity  metric,  the 
discrete  Frechet  distance,  to  compare  differential  methy¬ 
lation  signatures  for  all  genes.  Using  an  unsupervised  clus¬ 
tering  technique,  we  arrange  the  signatures  according  to 
their  shapes  and  identify  which  clusters  of  genes  exhibit 
statistically  significant  changes  in  expression.  Generalized 
patterns  of  differential  methylation  can  be  extrapolated 
from  the  resulting  clusters.  Because  the  approach  is  un¬ 
supervised,  no  assumptions  need  to  be  made  about  the 
direction  of  a  correlation  between  methylation  and  expres¬ 
sion.  Although  designed  for  pattern  discovery,  the 
method  is  easily  extended  to  identify  a  list  of  genes  poten¬ 
tially  regulated  by  methylation.  These  gene  lists  are  of 
markedly  greater  length  and  have  higher  quality  associ¬ 
ations  between  differential  methylation  and  expression 
than  those  generated  by  existing  methods.  A  current 
implementation  of  the  approach  can  be  found  on  the 
project  website  at  http://epigenomics.wustl.edu/WIMSi/. 

MATERIALS  AND  METHODS 

Methylation  signatures 

Using  WGBS  or  Methyl-MAPS  data  (single-base  reso¬ 
lution)  to  compare  methylation  patterns  between  different 
genes’  promoter  regions  is  complicated  by  the  variability 
in  the  locations  of  their  CpGs.  To  enable  comparisons 
between  genes,  we  standardize  the  representation  of  dif¬ 
ferential  methylation  data  for  each  gene  by  creating  a 
methylation  signature  across  a  fixed-width  region 
centered  at  the  TSS  (Figure  1).  In  a  single  sample,  the 
methylation  level  at  each  probed  CpG  is  represented  as 
a  continuous  value  between  0  and  1,  denoting  fully 
unmethylated  and  methylated,  respectively.  To  compare 
methylation  between  two  samples,  we  subtract  these 
values  at  each  site  to  produce  a  differential  methylation 
score  that  ranges  from  —1  to  1,  denoting  complete 
hypomethylation  and  hypermethylation,  respectively. 

We  create  a  gene’s  methylation  signature  on  a  fixed- 
width  target  region  by  interpolating  the  differential  methy¬ 
lation  scores  using  piecewise  cubic  Hermite  interpolation 
(Figure  IB  and  C)  across  all  CpG  sites  within  the  region 
and  between  one  and  five  CpG  sites  flanking  the  region  on 
either  side.  Because  methylation  is  highly  correlated  over 
short  distances  (Supplementary  Figure  SI)  (13), 


interpolation  provides  a  suitable  estimate  for  differential 
methylation  in  areas  with  missing  data.  Such  missing  data 
points  can  occur  owing  to  insufficient  coverage  or  experi¬ 
mental  limitations  (e.g.  data  collected  only  at  specific  re¬ 
striction  sites).  Interpolated  values  always  lie  between  —1 
and  1.  Regions  with  fewer  than  five  CpG  sites  with  suffi¬ 
cient  coverage  and  regions  with  fewer  than  one  flanking 
site  per  side  are  discarded.  We  also  discard  regions  con¬ 
taining  no  CpG  sites  with  absolute  differential  methyla¬ 
tion  >0.2.  Lastly,  we  apply  Gaussian  smoothing  on  the 
curves  (a  =  50  bp)  to  help  moderate  noise  due  to  experi¬ 
mental  artifacts  such  as  missing  or  inaccurate  measure¬ 
ments.  One  advantage  of  using  a  combination  of 
interpolation  and  smoothing  is  that  it  improves  perform¬ 
ance  of  the  method  for  low  coverage  data.  Previously  it 
was  shown  that  statistical  smoothing  was  an  effective 
method  to  analyze  low  coverage  WGBS  data  (14).  The 
resulting  methylation  signatures  for  each  gene  are 
bounded  between  —1  and  1  on  a  fixed  region  relative  to 
the  TSS. 

Determining  clusters  of  methylation  signatures  with 
significant  expression  changes 

We  compare  methylation  signatures  between  genes  using 
the  discrete  Frechet  distance,  also  known  as  the  coupling 
distance  (15,16).  The  Frechet  distance  is  informally  known 
as  the  dog-man  distance  because  it  represents  the 
minimum  length  of  leash  necessary  for  a  person  traversing 
one  curve  to  walk  a  dog  along  another,  assuming  neither 
party  is  allowed  to  walk  backwards.  The  Frechet  metric  is 
advantageous  because  it  is  efficiently  computable,  while 
still  taking  into  account  the  entire  course  of  the  curves. 
In  particular,  it  appeals  to  an  intuitive  notion  of  similarity 
between  methylation  signatures  in  that  two  curves  with 
similar  shape  will  have  a  low  distance,  even  if  one  curve 
is  shifted  slightly  from  the  other  relative  to  the  TSS 
(Supplementary  Figure  S2).  Because  Frechet  distance  is 
calculated  in  Euclidean  space,  a  scaling  parameter  must 
specify  the  relationship  between  the  x-axes  of  the  curves, 
in  bp,  and  the  y-axes,  in  differential  methylation  level. 
Intuitively,  this  parameter  controls  how  far  peaks  in  the 
y-direction  are  allowed  to  slide  across  the  x-axis  and  still 
be  identified  as  similar  between  two  genes. 

Formally,  a  methylation  signature,  or  curve, 
can  be  described  as  a  continuous  mapping 
f :  [0,1]  — >  [0,nb]  x  [—1,1]  where  is  the  fixed  length  of 
the  region  in  base  pairs.  For  two  curves  A  and  B,  the 
Frechet  distance  is  defined  as  follows: 

&Frechet{A,B)  =  inf  ma  xd(A(a(t)),B(P(t))) 

a,p  t 

where  te[ 0,1],  d(x,y )  is  the  Euclidean  distance  between  x 
and  y,  and  a(t)  and  f3(t )  are  continuous,  monotonically 
increasing  functions  from  [0,1]  to  [0,1]  such  that 
a(0)  =  /3( 0)  =  0  and  a(l)  =  /3(\)  =  1.  For  any  two  differ¬ 
ential  methylation  signatures,  we  calculate  similarity 
using  the  coupling  distance,  which  can  be  computed  on 
polygonal  curves  in  quadratic  time  using  a  dynamic 
programming  algorithm  (16).  Each  continuous 
interpolated  curve  is  converted  into  a  polygonal  curve 
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Figure  1.  Method  overview  and  example  methylation  signatures.  (A)  Overview  of  the  approach  for  associating  spatially  similar  DNA  methylation 
changes  with  corresponding  changes  in  transcription.  (B,  C)  Example  methylation  signatures  from  HMEC-HCC1954  WGBS  data  for  the  tumor 
suppressor  gene  CDH4  and  the  LPCAT3  gene.  Conversion  of  methylation  data  to  signatures  allows  direct  comparison  of  the  data  between  genes 
with  different  distributions  of  CpG  sites  relative  to  their  TSSs.  Top  panels  show  methylation  (blue  is  unmethylated  and  red  is  methylated)  and  RNA- 
Seq  expression  data  on  the  UCSC  genome  browser.  Bottom  panels  show  interpolated  and  smoothed  methylation  signatures  (black  curve)  that  are 
used  to  calculate  the  discrete  Frechet  distance.  Blue  tick  marks  show  locations  of  all  CpG  sites.  Black  dots  mark  experimentally  measured  differences 
in  methylation  between  the  two  samples. 


by  averaging  every  10  bp,  yielding  ceiling  1 10)  vertices  for 
the  nb  bases  in  the  target  region.  For  sparsely  sampled 
areas,  this  fixed  resolution  greatly  limits  the  discretization 
error  between  our  discrete  Frechet  distance  and  the  con¬ 
tinuous  version  because  this  error  is  bounded  by  the 
maximum  edge  length  for  each  pair  of  curves  (17). 
Averaging  every  10  bp  allows  the  algorithm  to  run  faster 
than  sampling  every  2  bp.  Because  the  averaging  interval  is 
considerably  smaller  than  the  width  of  the  Gaussian 
smoothing  kernel,  it  has  little  effect  on  the  resulting  dis¬ 
tances.  We  compared  the  results  of  clustering  for  several 
experiments  using  10  bp  averaging  and  dinucleotide 
sampling,  and  found  no  differences  in  the  type  and 
number  of  patterns  found  for  the  HMEC-HCC1954 
data  set.  By  default,  scaling  between  the  x-  and  y-axes 
was  set  such  that  2500  bp  along  the  x-direction  was 
equivalent  to  one  unit  of  differential  methylation  in  the 
y-direction.  We  tested  a  large  number  of  values  for  this 
ratio.  Manual  inspection  of  resulting  clusterings  showed 
that  the  final  set  of  patterns  discovered  was  not  sub¬ 
stantially  altered  despite  moderate  changes  to  the  scaling 
ratio. 

Methylation  signatures  are  arranged  using  unsupervised 
complete-linkage  agglomerative  hierarchical  clustering 
based  solely  on  the  discrete  Frechet  distance  between 


curve  pairs.  After  clustering,  we  introduce  the  differential 
expression  data  to  identify  clusters  of  genes  with  similar 
expression  differences.  Expression  data  can  be  obtained 
through  any  method,  including  RNA-Seq  and  expression 
array  analysis.  To  focus  on  genes  for  which  differential 
methylation  and  expression  could  be  related,  we  consider 
only  genes  with  greater  than  2-fold  expression  change. 

We  identify  clusters  from  the  dendrogram  where  methy¬ 
lation  is  significantly  associated  with  expression  as 
follows.  For  each  cluster,  we  evaluate  the  likelihood  that 
the  observed  group  of  expression  values  is  atypical 
compared  with  the  distribution  of  expression  values  over 
the  entire  training  set  [a  dendrogram  on  ns  signatures 
contains  exactly  (ns- 1)  clusters].  We  determine  significance 
using  a  two-sample  Kolmogorov-Smirnov  test  between 
the  set  of  differential  expression  values  in  each  cluster 
and  the  entire  population  of  expression  values.  The  false 
discovery  rate  is  controlled  at  0.05  using  the  Benjamini- 
Hochberg  procedure.  This  estimate  of  significance  is  con¬ 
servative  because  one  sample  is  a  subset  of  the  other. 

After  all  statistically  significant  clusters  are  identified, 
we  select  a  set  of  nonoverlapping  clusters  using  an  iterative 
algorithm  to  define  the  trade-off  between  grouping  similar 
patterns  together  versus  breaking  them  into  distinct 
clusters.  We  seek  a  balance  between  the  selection  of 
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larger  clusters  that  embody  more  general  patterns  and  the 
conformity  between  differential  expression  values,  called 
purity.  The  purity  of  a  set  of  genes  is  defined  as  the 
fraction  of  genes  that  have  expression  change  in  the 
same  direction  as  the  majority.  The  full  process  is 
described  in  the  Supplementary  Methods.  We  used  a 
minimum  purity  of  0.85  unless  otherwise  stated. 

Generating  a  gene  list 

To  produce  a  list  of  genes  with  associated  differential 
methylation  and  expression,  we  ran  the  discovery  method 
on  a  set  of  overlapping  5  kb  regions  centered  at  a  fixed  set 
of  locations  around  the  TSS  (Figure  6A).  The  scaling  factor 
between  the  x-  and  y-dimensions,  minimum  cluster  purity 
and  all  interpolation  parameters  were  the  same  as  previ¬ 
ously  described.  We  used  22  regions,  spanning  an  area 
from  [-25  kb,  25  kb]  relative  to  the  TSS.  The  area 
[—10  kb,  10  kb]  relative  to  the  TSS  was  covered  more 
densely  because  this  was  where  the  majority  of  relevant 
differential  methylation  features  were  observed  by  the 
pattern  discovery  tool.  The  leftmost  boundaries  (in  kb, 
relative  to  the  TSS)  were:  —25,  —20,  —15,  —10,  —9,  —8, 
-7,  -6,  -5,  -4,  -3,  -2,  -1,  0,  1,  2,  3,  4,  5,  10,  15  and  20. 
For  each  region,  we  recorded  the  set  of  genes  identified  as 
positives  (i.e.  changing  in  the  same  direction  as  the  majority 
of  their  cluster).  Genes  that  are  identified  for  at  least  m 
regions  are  added  to  the  final  list.  Unless  otherwise 
stated,  an  m  of  two  regions  was  used. 

RESULTS 

Pattern  discovery  in  high-resolution  methylation  data 

We  evaluated  our  technique  using  three  primary  and  nine 
additional  comparisons  (17  total  data  sets)  with  high-reso¬ 
lution  methylation  and  RNA-Seq  expression  data 
(Supplementary  Table  SI).  The  first  primary  data  set  was 
WGBS  of  nontumorigenic  human  mammary  epithelial  cells 
(HMEC)  and  breast  cancer  cells  (HCC1954)  (6)  containing 
methylation  data  for  84.7%  of  genomic  CpGs  with  a 
coverage  level  of  at  least  10  in  each  sample 
(Supplementary  Table  SI,  GEO:  GSE29127).  We  focused 
many  of  the  analyses  in  this  article  on  this  HMEC- 
HCC1954  data  set  because  it  has  high  coverage  and 
contains  examples  of  all  the  methylation  patterns  we  dis¬ 
covered  (Supplementary  Data  1).  To  examine  how  the 
method  performed  on  a  lower  coverage  data  set,  we 
examined  WGBS  data  for  HI  embryonic  stem  (ES)  cells 
and  IMR90  fetal  lung  fibroblasts  (9)  (GEO:  GSE16256). 
While  the  genomic  coverage  level  of  this  data  was  high, 
the  data  was  sparsely  sampled  at  promoters:  <40%  of 
CpGs  had  coverage  of  at  least  10  (Supplementary  Figure 
S3).  By  including  all  CpGs  with  coverage  as  low  as  a  single 
read,  the  data  covered  93.5%  of  genomic  CpGs.  However, 
methylation  scores  are  expected  to  be  less  accurate  in 
regions  with  lower  sampling.  WGBS  data  was  processed 
and  methylation  scores  computed  as  in  (6).  Analysis  was 
limited  to  only  CpG  methylation  (complete  results  are  in 
Supplementary  Data  2).  Lastly,  to  validate  our  findings 
using  data  from  an  alternative  experimental  method,  we 
generated  Methyl-MAPS  data  from  MCF7  and  T47D 


breast  cancer  cells.  Methyl-MAPS  uses  methylation-sensi- 
tive  and  -dependent  restriction  enzyme  digests  followed  by 
high-throughput  sequencing  to  identify  methylation  levels 
at  individual  CpGs  (10).  Libraries  were  constructed, 
sequenced  and  analyzed  as  in  (10)  (see  Supplementary 
Methods  for  further  details).  We  limited  our  analysis  to 
sites  interrogated  by  both  digests,  which  included  24.9% 
of  genomic  CpGs  with  coverage  of  at  least  five  to  ensure 
an  adequate  number  of  CpGs  with  data  around  each 
promoter.  Expression  data  for  each  sample  came  from 
poly(A)  selected  RNA-Seq  experiments  (see 
Supplementary  Methods  for  further  details).  All  Methyl- 
MAPS  and  RNA-Seq  data  are  available  from  GEO,  acces¬ 
sion  GSE45337.  Complete  results  are  in  Supplementary 
Data  3. 

In  addition  to  these  three  primary  comparisons,  we 
applied  our  method  to  WGBS  and  RNA-Seq  data 
comparing  H9  ES  cells  (18,19)  to  IMR90  cells,  HI  cells 
to  four  HI -derived  differentiated  cell  types,  female 
adipose-derived  stem  cells  (ADS)  to  ADS-derived  adipo¬ 
cytes  and  ADS-derived  induced  pluripotent  stem  cells 
(ADS-iPSCs)  (19)  and  primary  mouse  ES  cells  to 
isolated  sperm  and  oocytes  (20). 

We  selected  an  initial  region  for  the  discovery  of  differ¬ 
ential  methylation  patterns  that  associate  with  expression 
changes  based  on  two  criteria.  First,  average  CpG  density 
increases  roughly  2  kb  upstream  and  downstream  of  the 
TSS,  suggesting  that  sites  in  this  region  may  have  regula¬ 
tory  importance  (21).  Second,  increased  variability  of  dif¬ 
ferential  methylation  in  CpG  island  shores,  defined  as  the 
regions  up  to  2  kb  away  from  a  CpG  island,  has  been 
linked  to  differential  expression  (5).  To  search  for 
patterns  across  this  entire  area,  we  chose  a  conservative 
initial  region  of  10  kb  centered  on  the  TSS.  Using  a  cluster 
purity  threshold  of  0.85,  we  identified  27  clusters  that  were 
significantly  correlated  with  differential  expression,  con¬ 
taining  519  genes,  in  the  HMEC-HCC1954  data  set 
(Figure  2;  complete  clustering  results  are  in 
Supplementary  Data  1).  A  cartoon  depiction  of  each  of 
the  patterns  observed  is  shown  in  Figure  3.  Applying  our 
method  to  the  H1-IMR90  and  MCF7-T47D  Methyl- 
MAPS  data  sets  showed  that  our  method  could  still 
identify  clusters  corresponding  to  each  of  the  patterns  dis¬ 
covered  in  the  higher  quality  HMEC-HCC1954  data  set 
even  with  low  promoter  coverage  or  substantially  reduced 
sampling  (Supplementary  Data  2  and  3).  It  is  likely  that 
interpolation  and  Gaussian  smoothing  are  helpful  for 
analyzing  low  coverage  data.  Limiting  HMEC-HCC1954 
data  to  only  sites  probed  by  Methyl-MAPS  showed  the 
data  reduction  had  no  impact  on  the  ability  to  detect  each 
of  the  identified  patterns.  As  a  negative  control,  we 
randomly  scrambled  the  expression  values  for  all  genes 
in  each  data  set;  any  cluster  identified  as  significant  was 
a  false  positive.  For  1000  random  permutations,  our  tech¬ 
nique  identified  a  false-positive  cluster  in  1.7-2. 3%  of  the 
experiments  (Supplementary  Table  S2). 

Patterns  overlapping  the  TSS 

From  the  resulting  sets  of  significant  clusters,  we  sought  to 
characterize  the  common  features  of  the  methylation 
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Figure  2.  Example  of  clustering  methylation  signatures  from  HMEC- 
HCC1954  WGBS  data.  (A)  Complete  dendrogram  for  clustering  3566 
methylation  signatures.  Clusters  highlighted  in  orange,  magenta  and 
cyan  indicate  significant  clusters  with  purity  of  at  least  0.75,  0.85  and 
0.95,  respectively.  Subclusters  featured  in  (B,  C)  are  indicated  with 
arrows.  A  heat  map  of  expression  data  is  plotted  in  bars  alongside 
the  dendrogram.  F.C.  denotes  expression  fold  change;  green  indicates 
downregulation,  red  indicates  upregulation.  (B,  C)  Left  side  shows  the 
complete  cluster  with  boxes  to  indicate  expression.  On  the  right  side, 
the  top  panels  are  cartoons  depicting  the  relevant  pattern;  the  bottom 
panels  show  example  signatures  from  subclusters  of  a  significant 
HMEC-HCC1954  cluster.  Clustering  was  performed  on  10  kb  regions. 

(B)  Subcluster  showing  a  pattern  of  methylation  increase  across  the 
TSS,  similar  to  the  classic  methylation  of  a  CpG  island  across  the 
TSS.  Patterns  from  the  entire  cluster  are  in  Supplementary  Figure  S4. 

(C)  Subcluster  showing  a  pattern  defined  by  decrease  in  methylation  3' 
of  the  TSS,  similar  to  a  methylation  change  at  a  CpG  island  shore.  The 
entire  cluster  is  in  Supplementary  Figure  SI  1. 
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Figure  3.  Summary  of  the  diverse  patterns  found  in  the  data  sets 
analyzed.  One  version  of  each  pattern  is  shown  to  the  left  with  its 
corresponding  inverted  pattern  to  the  right.  (A)  Patterns  that  overlap 
the  TSS  (TSS1,  TSS2  and  LONGO),  and  their  respective  inverted 
versions  (TSSli,  TSS2i,  and  LONGOi).  (B)  Patterns  proximal  to  the 
TSS  all  include  a  region  of  change  downstream  of  the  TSS  and  are 
separated  by  their  upstream  differences.  Shown  are  PROX1,  PROX2, 
PROX3  and  their  respective  inverted  versions:  PROXli,  PROX2i  and 
PROX3i. 


signatures  that  may  be  responsible  for  the  observed  rela¬ 
tionships  with  expression  change.  As  expected,  many 
clusters  contained  patterns  with  a  region  of  strong  hyper- 
or  hypomethylation  spanning  the  TSS  that  negatively 
correlated  with  expression  change  (Figure  2B).  Further  in¬ 
spection  of  HMEC-HCC1954  data  revealed  several  distinct 
patterns  overlapping  the  TSS.  After  rerunning  our  method 
using  methylation  signatures  based  on  a  30  kb  region 
centered  at  the  TSS,  three  distinct  patterns  emerged 
(Figure  3A):  a  hypermethylated  region  at  the  TSS  sur¬ 
rounded  by  long  hypomethylated  domains  (TSS1;  Figure 
4A),  a  hypermethylated  region  at  the  TSS  set  in  a  region 
with  invariant  methylation  levels  (TSS2;  Figure  2B)  and  a 
pattern  of  long  hypomethylated  domains,  but  with  no 
change  in  methylation  at  the  TSS  (LONGO;  Figure  4F). 
The  rarest  pattern,  LONGO,  was  also  observed  in  the  H9- 
IMR90  (Supplementary  Figure  S7)  and  H1-IMR90 
(Supplementary  Data  2)  comparisons.  Hypomethylated 
domains  associated  with  these  patterns  (TSS1,  LONGO) 
extend  up  to  several  Mb  in  both  directions  (Figure  4C). 
In  HMEC-HCC1954  data,  we  found  a  hypomethylated 
region  at  the  TSS  set  in  a  hypermethylated  region  (TSSli; 
Figure  4B).  While  not  all  patterns  were  observed  in  every 
comparison  we  analyzed,  the  overall  set  of  patterns  was 
common  across  all  data  sets.  For  instance,  MCF7-T47D 
and  H1-IMR90  data  sets  show  an  inverted  TSS2  pattern 
(TSS2i;  Supplementary  Data  2  and  3).  IMR90  fibroblasts 
exhibit  a  TSS1  pattern  relative  to  HI  stem  cells 
(Supplementary  Data  2),  although  the  hypomethylated 
domains  in  fibroblasts  are  much  smaller  (Figure  4E).  It 
has  been  suggested  that  the  observed  positive  correlation 
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Figure  4.  Analysis  of  patterns  overlapping  the  TSS.  (A,  B)  Top  panel  is  a  cartoon  depicting  the  relevant  pattern.  Bottom  panel  shows  an  example 
subcluster  from  an  HMEC-HCC1954  cluster.  Clustering  was  performed  on  30  kb  regions.  F.C.  denotes  expression  fold  change;  green  indicates 
downregulation,  red  indicates  upregulation.  (A)  Subcluster  characterized  by  hypermethylation  at  the  TSS  set  in  a  long  hypomethylated  domain 
(TSS1).  The  entire  cluster  is  in  Supplementary  Figure  S6.  (B)  Subcluster  characterized  by  hypomethylation  at  the  TSS  set  in  a  long  hypermethylated 
domain  (TSSli).  The  entire  cluster  is  in  Supplementary  Figure  S8.  (C)  Average  differential  methylation  and  CpG  density  for  genes  from  clusters 
identified  as  exhibiting  three  TSS  patterns  (TSS1,  TSSli  and  TSS2).  Example  signatures  for  each  of  the  three  patterns  are  shown  in  parts  (A),  (B)  and 
Figure  2B.  (D)  Alu  elements  and  RefSeq  genes  are  depleted  in  the  regions  around  genes  with  the  TSS1  pattern.  No  enrichment  or  depletion  of 
other  repeats  was  found  (Supplementary  Figure  S9).  (E)  The  TSS1  pattern  is  also  found  in  the  H1-IMR90  comparison  (Supplementary  Data  2). 
(F)  Example  subcluster  showing  the  LONGO  pattern  from  the  HMEC-HCC1954  comparison.  The  entire  cluster  is  in  Supplementary  Figure  S6. 


between  gene-body  methylation  and  expression  may  be  due 
to  similar  domains  (8).  Inspection  of  individual  genes  with 
the  LONGO  pattern  revealed  that  promoters  were 
hypomethylated  in  both  samples,  leading  us  to  speculate 
that  long  hypomethylated  domains  could  contribute  to 
gene  silencing,  possibly  through  the  recruitment  of 
factors  that  lead  to  the  formation  of  repressive  chromatin 
(6).  Clusters  representing  the  inverse  patterns — a 
hypomethylated  region  at  the  TSS  set  in  either  long 
hypermethylated  or  invariant  regions — were  also 
observed  (TSSli,  TSS2i;  Figure  4B). 

Patterns  proximal  to  the  TSS 

In  addition  to  patterns  with  differential  methylation  across 
the  TSS,  we  identified  multiple  clusters  in  all  data  sets 
characterized  by  a  change  in  methylation  downstream  of 
the  TSS  (Figure  2C,  Figure  5A  and  B).  Analysis  of  all  genes 
in  clusters  associated  with  this  pattern  indicated  that  it  pre¬ 
dominantly  occurs  within  3  kb  of  the  TSS  (Figure  5C).  This 
y  pattern  was  observed  in  several  distinct  clusters  due  to 
variations  in  other  parts  of  the  differential  methylation 
curves  (see  examples  in  Figure  5A  and  B).  The  TSS- 
proximal  patterns  found  by  our  discovery  method  are  in 
agreement  with  the  variation  in  methylation  found  at  CpG 
island  shores.  Interestingly  though,  we  find  no  significant 
association  between  the  proximal  methylation  patterns  and 
whether  promoters  are  classified  as  CpG-rich  or  -poor 
(Supplementary  Figure  S10).  This  may  suggest  that  these 
proximal  methylation  changes  are  not  confined  to  island 


shores,  or  it  may  be  due  to  the  fact  that  the  patterns  we 
discover  are  anchored  to  the  TSS. 

Although  some  clusters  with  3'  patterns  also  displayed 
methylation  change  upstream  of  the  TSS,  no  independent 
relationship  was  observed  between  a  5'  pattern  and  differ¬ 
ential  expression  (Figure  5A  and  B).  Differential  methyla¬ 
tion  downstream  of  the  TSS  was  consistently  linked  with 
expression  change,  regardless  of  upstream  hyper-  or 
hypomethylation.  To  further  probe  whether  distinct  cor¬ 
relative  patterns  occur  upstream  of  the  TSS,  we  ran  our 
method  using  signatures  defined  on  the  region  from  —5  kb 
to  the  TSS.  The  majority  of  identified  clusters  were 
characterized  by  changes  in  methylation  at  the  TSS. 
Genes  in  clusters  with  methylation  changes  5'  of  the  TSS 
often  had  correlative  3'  changes  as  well.  We  found  no 
clusters  in  any  of  the  data  sets  that  supported  a  convincing 
association  between  5'  changes  and  expression.  Examining 
5  kb  regions  shifted  downstream  of  the  TSS  discovered 
more  patterns  than  those  upstream  of  the  TSS,  consistent 
with  the  existence  of  the  3'  pattern  (Figure  5D)  and 
absence  of  a  5'  pattern.  From  the  cumulative  evidence, 
we  speculate  that  CpG  island  shore-like  regions 
upstream  of  the  TSS  may  not  be  independently  associated 
with  expression  changes.  Furthermore,  while  we  observed 
that  differential  methylation  patterns  across  the  TSS  were 
generally  associated  with  gene  silencing,  3'  methylation 
patterns  correlated  more  often  with  downregulation  than 
complete  silencing  (Supplementary  Figure  SI  2).  3' 

hypermethylation  events  have  previously  been  shown  to 
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Figure  5.  Patterns  containing  3'  methylation  changes  are  inversely  correlated  with  expression,  independent  of  changes  57  of  the  TSS.  (A,  B)  Top 
panel  is  a  cartoon  depicting  the  relevant  pattern.  Bottom  panel  shows  an  example  subcluster  from  the  default  HMEC-HCC1954  clustering  on  a  10  kb 
region.  Entire  clusters  for  both  parts  are  displayed  in  Supplementary  Figure  SI  1.  (A)  Subcluster  exhibiting  genes  with  a  decrease  in  methylation  on 
the  3'  side  of  the  TSS  and  an  increase  in  methylation  5'  of  the  TSS  and  with  a  significant  increase  in  expression  (PROX3i).  (B)  Subcluster  exhibiting 
genes  with  an  increase  in  methylation  on  both  3'  and  5;  sides  of  the  TSS  and  with  a  significant  decrease  in  expression  (PROX2).  (C)  Meta-gene 
analysis  of  genes  from  significant  clusters  whose  dominant  pattern  was  identified  as  including  a  37-proximal  component  shows  that  the  3'  pattern  is 
typically  confined  to  within  3  kb  of  the  TSS.  From  top  to  bottom,  the  three  panels  depict  averages  for  HMEC-HCC1954,  MCF7-T47D  and  Hl- 
IMR90  data  sets.  (D)  The  number  of  new  genes  identified  increases  downstream  of  the  TSS,  but  not  upstream.  We  started  by  identifying  genes  in 
significant  clusters  for  a  5  kb  region  centered  at  the  TSS.  We  iteratively  moved  the  5  kb-wide  region  upstream  of  the  TSS  and  identified  all  genes  in 
significant  clusters  not  found  previously.  This  process  was  repeated  for  each  of  the  region  positions  used  by  the  gene  list  tool.  The  entire  process  was 
then  repeated  for  downstream  of  the  TSS. 


affect  expression  (22),  which  supports  the  idea  that  the  3' 
changes  we  observe  could  be  potentially  functional. 

Genomic  features  associated  with  particular  DNA 
methylation  patterns 

Gene  ontology  and  expression  signature  analysis  (23)  of 
the  genes  found  to  have  correlated  differential  methylation 
and  expression  showed  that  there  was  enrichment  for 
cancer-related  genes  in  HMEC-HCC1954  data,  while 
there  was  enrichment  for  genes  associated  with  differenti¬ 
ation  in  H1-IMR90  data.  Long  hypomethylated  and  par¬ 
tially  methylated  domains  in  cancer  cells  were  previously 
observed  to  have  distinct  sequence  contexts  (6,7).  We  find 
that  genes  with  the  TSS1  pattern  are  associated  with  a 
depletion  of  CpG  density,  Alu  elements  and  gene  density 
relative  to  all  gene  promoters  (Figure  4C  and  D).  These 
depletions  are  observed  in  regions  ranging  from  10  kb  up 
to  0.5  Mb  from  the  TSS.  In  summary,  genes  exhibiting  the 
TSS1  pattern  have  the  same  distinct  sequence  properties  as 
genes  found  in  long  hypomethylated  domains  (7). 

There  has  been  significant  interest  in  addressing  whether 
specific  sequences  direct  or  inhibit  methylation  at 


particular  promoters  or  CpG  islands  (5,24-26).  One  pos¬ 
sibility  is  that  different  sequences  may  direct  different 
methylation  patterns.  A  preliminary  search  for  motifs 
associated  with  each  discovered  pattern  did  not  find  a  sig¬ 
nificant  enrichment  for  any  novel  or  known  motifs 
associated  with  one  pattern  more  than  another.  In 
addition,  we  find  no  significant  association  between  each 
of  the  different  observed  methylation  patterns  and 
whether  promoters  are  classified  as  CpG-rich  or  CpG- 
poor  (Supplementary  Figure  S10). 

Quantifying  the  sensitivity  of  pattern  discovery 

Because  the  true  patterns  of  differential  methylation  that 
correlate  with  expression  change  are  unknown,  an  obvious 
question  is  whether  we  have  detected  all  correlative  methy¬ 
lation  patterns  in  the  data  set.  To  address  this  issue  and 
quantify  the  method’s  ability  to  recover  genes  from  known 
correlative  patterns,  we  created  a  model  to  simulate  dif¬ 
ferential  methylation  signatures  and  expression  values.  A 
complete  description  of  the  methods  for  simulated  data  is 
in  the  Supplemental  Methods. 
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Simulated  genes  with  a  predefined  correlation  between 
methylation  and  expression  were  added  into  two  kinds  of 
background  data  sets:  randomly  generated  simulated 
genes  with  no  correlation  with  expression  (Supplementary 
Figure  S13-S15)  and  real  data.  Using  wholly  simulated 
data  we  explored  the  types  of  patterns  our  method  could 
discover.  We  successfully  detected  a  variety  of  introduced 
patterns  including  peaks  of  differential  methylation  with 
fixed  widths  and  locations  relative  to  the  TSS 
(Supplementary  Figure  SI 6),  peaks  with  varying  location 
(Supplementary  Figure  SI 7)  and  peaks  with  different  vari¬ 
ances  in  height  (Supplementary  Figure  SI 8).  More 
complex  patterns  were  more  easily  detected,  presumably 
because  highly  constrained  curves  are  more  likely  to  have 
similar  Frechet  distances. 

We  also  explored  the  method’s  ability  to  discover  new 
patterns  in  a  background  of  real  methylation  data.  We 
introduced  three  simulated  patterns  (Supplementary 
Figure  SI 9)  into  HMEC-HCC1954  WGBS  data.  We 
varied  the  number  of  genes  added,  and  measured  both 
the  ability  to  discover  the  patterns  and  the  number  of 
genes  correctly  identified  (Supplementary  Figure  S20). 
We  found  that  the  patterns  were  easily  discoverable 
when  at  least  50-100  simulated  genes  were  introduced, 
depending  on  the  particular  simulated  pattern. 
Performance  was  impacted  when  one  of  the  simulated 
patterns  was  too  similar  to  the  existing  patterns 
(Supplementary  Figure  S21).  Furthermore,  we  found 
that  the  fraction  of  simulated  genes  required  to  reliably 
detect  a  pattern  decreased  as  the  number  of  genes  in  the 
data  set  increased  (Supplementary  Figure  S22).  This 
suggests  that  one  could  detect  rare  patterns  by  simply 
concatenating  multiple  data  sets. 

Enumerating  genes  with  correlative  methylation  signatures 

Generating  a  list  of  genes  for  which  expression  and  methy¬ 
lation  changes  are  potentially  linked  is  a  primary  interest 
of  any  genome- wide  methylation  profiling  experiment. 
The  method  described  above  is  tuned  to  discover 
patterns,  not  to  create  a  gene  list.  Individual  genes 
within  good  clusters  will  sometimes  be  false  positives, 
and  other  genes  will  be  excluded  because  their  clusters 
do  not  meet  the  high  purity  threshold.  To  produce  a 
better  gene  list,  we  executed  our  method  on  a  set  of 
overlapping  5  kb  regions  centered  at  a  fixed  set  of  loca¬ 
tions  around  the  TSS  (Figure  6A).  Comparison  of  gene 
lists  from  H1-IMR90  replicate  WGBS  data  shows  good 
concordance  between  replicates.  To  determine  the  extent 
to  which  genes  are  incorrectly  included  owing  to  the 
creation  of  errant  clusters,  we  randomly  scrambled  the 
expression  values  in  the  HMEC-HCC1954  dendrogram. 
For  1000  such  experiments  using  default  clustering  par¬ 
ameters,  only  seven  experiments  returned  any  false-posi¬ 
tive  genes:  six  reported  a  single  false  positive  and  a  seventh 
returned  two.  We  also  tested  the  ability  of  the  gene  list 
method  to  recover  introduced  simulated  genes  (see 
‘Quantifying  the  sensitivity  of  pattern  discovery’  section 
above).  For  several  simulated  patterns,  we  found  that  the 
resultant  gene  lists  included  nearly  all  simulated  genes 
when  50-100  were  present  (Supplementary  Figure  S20). 


Comparison  to  other  approaches 

We  compared  the  quality  of  the  gene  lists  produced  by  our 
approach  to  lists  constructed  by  two  commonly  used 
methods.  For  the  DMR  approach,  regions  of  differential 
methylation  are  defined  between  two  samples  using  a 
sliding  window.  DMRs  are  coupled  to  a  particular  gene 
using  a  cutoff  for  the  distance  between  the  DMR  and  the 
gene’s  TSS  (7,9).  For  the  promoter-based  approach,  a  fixed 
window  around  each  gene’s  TSS  defines  the  gene’s 
promoter.  If  methylation  changes  substantially  within 
this  window,  the  gene  is  labeled  as  differentially  methylated 
(10,27).  We  optimized  DMR-  and  promoter-based 
approaches  for  each  data  set  using  69  360  and  6174  param¬ 
eter  choices,  respectively  (Supplementary  Figure  S23  and 
Supplementary  Table  S3),  while  using  a  single  common  set 
of  parameters  for  our  approach  across  all  data  sets. 

Judging  the  quality  of  the  gene  lists  is  difficult  because 
there  is  no  experimental  gold  standard  data  set  for  which 
the  relationship  between  methylation  at  specific  CpG  sites 
and  expression  is  well  known  for  all  genes.  A  correlation 
coefficient  has  often  been  used  to  quantify  the  association 
between  methylation  and  expression.  When  applied  to  a  list 
of  genes  for  which  methylation  is  predicted  to  associate 
with  expression  change,  however,  the  correlation  coefficient 
only  judges  one  part  of  the  method’s  performance.  For 
example,  by  varying  its  parameters,  the  DMR  method 
can  produce  short  gene  lists  with  strong  correlations  or 
long  lists  with  weak  correlations.  To  compare  methods, 
we  directly  examine  the  trade-off  between  the  total 
number  of  differentially  expressed  genes  identified  as  poten¬ 
tially  correlated  versus  the  fraction  of  identified  genes  that 
are  actually  correlated  in  the  predicted  direction  (Figure  6). 
This  trade-off  is  somewhat  analogous  to  comparing  the  rate 
of  total  positives  to  the  rate  of  true  positives,  with  the  true¬ 
negative  and  false-negative  rates  being  unknown.  On  the 
basis  of  these  criteria,  our  approach  clearly  outperforms 
DMR-  and  promoter-based  methods.  For  example, 
consider  the  HMEC-HCC1954  data  (Figure  6B).  When 
the  DMR  method  is  examined  at  a  level  where  it  correctly 
associates  the  direction  of  expression  change  92%  of  the 
time,  it  returns  only  26  genes  (0.7%  of  all  differentially 
expressed  genes).  Our  approach  produces  a  list  of  461 
genes  (13%)  at  a  correct  association  rate  of  95%.  With 
less  restrictive  criteria,  the  DMR  method  gives  a  list  of 
717  genes  (20%)  at  a  correct  association  rate  of  71%. 
Our  technique  gives  a  list  of  roughly  the  same  size  (761 
genes)  at  a  correct  association  rate  of  93%. 

Table  1  presents  results  from  each  of  the  different  data 
sets  we  analyzed  and  includes  data  from  primary  cells  and 
cell  lines.  All  analyses  were  performed  using  the  default 
parameters.  These  results  imply  that  prior  approaches 
have  underestimated  the  strength  of  the  relationship 
between  differential  methylation  and  expression.  With  a 
better  model  of  the  underlying  patterns  of  methylation 
change,  it  is  clear  that  methylation  and  expression  data 
are  highly  associated. 

Gene  lists  for  low  coverage  data  sets 

We  next  sought  to  examine  how  our  approach  performed 
with  suboptimal  data.  Our  technique  returned  similar 
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Figure  6.  Gene  lists  generated  by  our  method  are  of  markedly  greater  length  and  quality  than  those  generated  by  alternative  methods.  (A)  Schematic 
of  our  process  for  generating  gene  lists.  (B-D)  Comparison  of  gene  lists  generated  using  our  approach  with  those  from  optimized  DMR  and 
promoter-based  methods  for  (B)  HMEC-HCC1954  WGBS,  (C)  IMR90-H1  WGBS  and  (D)  MCF7-T47D  Methyl-MAPS  data.  The  plot  shows  the 
trade-off  between  the  number  of  genes  identified  as  being  associated  with  differential  expression  based  on  their  methylation  (x-axis)  and  the  quality  of 
the  associations  (y-axis),  which  is  the  fraction  of  identified  genes  for  which  the  direction  of  expression  change  matches  the  expected  direction  based 
on  methylation.  Points  up  and  to  the  right  indicate  better  performance;  50%  quality  is  equivalent  to  random  guessing.  Only  optimal  parameter 
choices  with  an  inverse  correlation  between  methylation  and  expression  are  shown  for  DMR-  and  promoter-based  approaches  (see  Supplementary 
Figure  S23  for  further  information  about  this  optimization).  Promoter-based  approaches  were  optimized  across  both  all  CpG  sites  (All  Sites)  and  all 
significant  CpG  sites  (All  Sig.  Sites).  The  minimum  number  of  required  windows  m  was  varied  in  our  method  from  2  to  5  to  show  the  effects  of 
increased  stringency. 


Table  1.  Summary  of  the  numbers  of  genes  identified  by  the  gene  list  tool  for  each  comparison 


Sample  Comparison 

Differentially 

expressed 

genes 

CpGs  with  methylation 
difference  >0.3 
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ADS  -  ADS-iPSC 
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93 
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HI  -  HI -Mesenchymal 

3714 
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40 
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79 

3 

96% 

HI  -  H1-BMP4 
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65 
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2353 
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5 
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H9-IMR90 
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58 

91% 

oocyte  -  ES  cell  (mouse) 

4727 
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25 

93% 

sperm  -  ES  cell  (mouse) 

4580 

4  364748 

1027 
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91% 
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results  for  the  low  coverage  IMR90-H1  comparison, 
which  has  no  minimum  coverage  cutoff,  while  DMR- 
and  promoter-based  methods  struggled  (Figure  6C).  We 
also  ran  experiments  on  the  HMEC-HCC1954  sample  pair 
to  test  the  method’s  performance  on  downsampled  data. 
First,  we  removed  raw  sequence  reads  at  random,  finding 
that  WGBS  data  obtained  with  an  average  coverage  as 
low  as  seven  resulted  in  little  loss  in  our  method’s  ability 
to  identify  genes  (Supplementary  Figure  S24A).  We  also 
removed  methylation  scores  at  random  from  the  mapped 
data,  and  found  that  94.5%  of  genes  were  still  detected 
when  50%  of  the  CpGs  were  removed  (Supplementary 
Figure  S24B).  Additionally,  our  method  outperforms  the 
optimized  DMR-  and  promoter-based  methods  for  the 
MCF7-T47D  sample  pair  collected  using  Methyl- 
MAPS,  which  contained  37.4%  of  the  CpG  sites  from 
the  10  kb  region  centered  at  the  TSS.  These  results 
suggest  that  our  method  is  robust  in  the  context  of 
missing  or  low  coverage  data. 


DISCUSSION 

A  summary  of  the  patterns  discovered  in  the  data  sets 
analyzed  is  presented  in  Figure  3.  One  primary  advance 
of  our  approach  is  in  its  use  of  spatial  information.  Several 
of  the  patterns  we  found  demonstrate  the  need  for  using 
spatial  information  about  specific  CpG  sites  when  trying 
to  connect  differential  methylation  to  expression  change. 
For  instance,  the  LONGO  pattern  is  positively  correlated 
with  expression,  while  the  3/-hypomethylation  pattern, 
commonly  seen  in  conjunction  with  a  S'-hypomethylation 
event  (PROX2i),  is  negatively  correlated.  Given  methyla¬ 
tion  marks  from  one  of  these  two  cases,  it  is  clear  that  the 
spatial  information  about  specific  CpG  locations  is  neces¬ 
sary  to  successfully  determine  the  direction  of  expression 
change.  This  may  help  explain  an  observation  made 
during  an  analysis  of  82  methylation  data  sets  from 
human  tissues  and  cell  lines  using  reduced  representation 
bisulfite  sequencing  (12).  The  authors  observed  that 
methylation  changes  in  CpG  island  sites  greater  than 
2  kb  downstream  were  sometimes  positively  correlated 
with  expression  and  sometimes  negatively  correlated. 
Based  on  our  findings,  one  explanation  for  their  observa¬ 
tion  is  that  they  are  observing  a  mixture  of  genes  with 
LONGO  and  3'  patterns. 

We  also  find  that  despite  variation  throughout  the 
vicinity  of  a  gene’s  promoter,  expression  change  is  often 
well  correlated  with  only  the  methylation  changes  in  a 
confined  area  relative  to  the  TSS.  This  point  is  well  sup¬ 
ported  by  the  many  examples  of  the  3/-proximal  pattern, 
which  is  seen  with  a  variety  of  methylation  activity 
upstream  including  hypermethylation,  hypomethylation 
or  little  activity  (Figure  3B).  DMR-based  methods  are 
generally  tuned  to  find  wider  regions  than  what  we 
observe  in  our  3/-proximal  clusters,  to  better  identify 
genes  with  classical  differential  methylation  at  a  CpG 
island  promoter  (e.g.  TSS2  and  TSS2i).  While  these 
methods  can  be  tuned  to  locate  narrower  DMRs,  this 
would  result  in  more  cases  where  contradictory  DMRs 
exist  near  the  TSS  (e.g.  one  hypermethylated  DMR  and 


one  hypomethylated  DMR,  such  as  PROX3  in  Figure  3B). 
In  these  cases,  there  is  no  obvious  way  to  decide  which 
DMR  may  be  influencing  transcription  without 
introducing  pre-learned  spatial  information.  As  another 
example,  the  DMR  approach  has  difficulty  discriminating 
between  the  LONGO  pattern  that  is  associated  with  a 
decrease  in  expression  and  the  PROXli  and  PROX2i 
patterns  that  are  associated  with  an  increase  in  expression. 
When  set  to  find  long  regions  downstream  of  the  TSS, 
DMRs  can  be  identified  with  positive  correlation 
between  expression  and  methylation.  When  set  to  find 
short  regions,  DMRs  can  be  identified  with  negative  cor¬ 
relations.  However,  no  single  set  of  DMR  parameters  can 
easily  simultaneously  capture  each  differential  methyla¬ 
tion  pattern  and  its  correlation  with  expression.  These 
examples  also  underscore  a  fundamental  limitation  of 
the  DMR  method:  it  cannot  be  used  to  discover  new 
patterns,  but  can  only  search  for  a  limited  set  of  relation¬ 
ships  that  are  already  known  to  exist. 

Spatial  information  has  proven  useful  to  the  analysis  of 
other  epigenetic  data  sets  as  well.  Recently,  a  model  con¬ 
sidering  the  spatial  locations  of  chromatin  marks  was  used 
to  train  a  support  vector  machine  to  predict  chromatin 
signals  at  transcription  factor  binding  sites  (28).  The 
authors  divided  each  fixed  region  into  50  bp  bins,  each 
of  which  was  used  as  a  feature  in  a  vector  for  classifica¬ 
tion.  The  success  of  this  approach  demonstrates  the  po¬ 
tential  of  spatial  models  to  better  capture  the  nature  of  the 
epigenetic  data  than  is  possible  with  a  simple  window- 
based  approach.  However,  for  analyzing  DNA  methyla¬ 
tion,  it  is  important  to  account  for  the  relationship 
between  such  bins  rather  than  treating  them  as  independ¬ 
ent  features.  For  instance,  in  Figure  2C,  the  topmost  and 
bottommost  example  curves  have  a  similar — but  slightly 
shifted — 3/-proximal  decrease  in  methylation  and  are  both 
upregulated.  If  we  used  predefined  nonoverlapping  bins 
here,  they  would  either  be  so  narrow  that  the  top  peak 
and  bottom  peak  lie  in  different  bins,  or  so  wide  that  the 
salient  feature  of  the  curve  is  diminished.  Because  our 
shape-similarity  approach  tolerates  some  movement  of 
peaks  in  the  x-direction,  features  such  as  these  are  natur¬ 
ally  associated  with  one  another. 

Finally,  our  findings  have  implications  for  how  DNA 
methylation  should  be  assayed  to  preserve  all  potentially 
useful  information.  Using  downsampled  WGBS  data  and 
Methyl-MAPS  data  we  demonstrated  that  we  can  discover 
the  full  set  of  patterns  from  Figure  3  without  needing 
values  at  all  CpG  sites,  as  long  as  they  are  removed  in  a 
relatively  unbiased  manner  (Supplementary  Figure  S24, 
Supplementary  Data  3).  However,  information  from 
some  sites  may  be  more  informative  than  others.  Many 
experimental  methods  attempt  to  assess  a  gene’s  methyla¬ 
tion  state  by  restricting  analysis  to  only  CpG-rich  regions, 
or  to  only  a  handful  of  CpG  sites  in  and  around  the 
promoter.  It  is  unclear  whether  such  techniques  measure 
methylation  at  the  correct  sites  to  discriminate  the  full 
spectrum  of  methylation  patterns  that  correlate  with  tran¬ 
scriptional  changes.  As  additional  single-base  resolution 
genome-wide  DNA  methylation  data  sets  become  avail¬ 
able,  we  can  explore  which  subsets  of  individual  CpGs 
most  inform  promoter  methylation  patterns  and  thus 
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need  to  be  experimentally  measured  to  accurately  assess 
changes  in  a  gene’s  methylation  state. 


CONCLUSIONS 

Our  findings  suggest  that  characterizing  gene  promoters 
simply  as  ‘methylated’  or  ‘unmethylated’  is  insufficient.  By 
considering  the  entire  set  of  methylation  changes  near  the 
promoter,  we  found  and  described  a  variety  of  methyla¬ 
tion  patterns  that  correlate  with  expression  change.  The 
power  of  our  method  is  its  ability  to  discover  and  separate 
distinct  patterns  without  any  prior  knowledge  about 
existing  relationships,  which  cannot  be  accomplished 
with  contemporary  approaches.  This  allows  us  to  use  the 
full  potential  of  unbiased  genome-wide  profiling  of  DNA 
methylation  to  reveal  previously  unknown  information 
about  methylation’ s  functional  role.  Although  we 
applied  our  method  on  regions  of  various  widths  around 
the  TSS,  all  correlative  patterns  except  those  associated 
with  the  long  hypomethylated  domains  were  found 
within  5  kb  of  the  TSS  (Figure  5D).  Interestingly,  all 
patterns  found  in  the  cancer  cell  data  sets  were  also 
found  in  the  ES  and  iPSC  cell  data  sets.  While  5' 
upstream  patterns  are  observed,  these  appear  to  occur 
due  to  correlations  with  the  3'  downstream  patterns.  By 
appropriately  capturing  the  diverse  set  of  methylation 
patterns  that  exist,  we  observe  a  high  level  of  association 
between  changes  in  a  gene’s  methylation  state  and  changes 
in  its  expression. 

A  strength  of  the  technique  described  here  is  its  poten¬ 
tial  for  expansion  to  examine  more  general  epigenetic 
modifications.  We  have  confined  our  analysis  to  data 
from  genome-wide  single-base  resolution  methods  such 
as  WGBS  or  Methyl-MAPS.  However,  similar  approaches 
could  be  used  to  analyze  the  relationships  between  ex¬ 
pression  and  other  epigenetic  patterns,  such  as 
5-hydroxymethylcytosine,  non-CpG  methylation,  and 
histone  modifications.  All  expression  data  used  in  this 
study  came  from  single-end  short  read  RNA-Seq  experi¬ 
ments.  If  long-read  paired-end  RNA-Seq  data  were 
available,  it  could  easily  be  used  with  our  method 
to  understand  alternate-promoter  and  isoform-specific 
methylation  patterns.  Considering  the  explosion  of  experi¬ 
ments  aiming  to  profile  epigenetic  landscapes,  this  method 
represents  a  valuable  tool  for  exploring  the  relationships 
between  changes  in  epigenetic  patterns  and  transcription 
both  in  normal  cellular  function  and  in  human  disease. 
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